Skip to content

PERF: fix SparseArray._simple_new object initialization #32821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 19, 2020

Apart from this being more idiomatic, it also avoids creating a SparseArray through the normal machinery (including validation of the input etc) for the empty list.

With this PR:

In [1]: data = np.array([1, 2, 3], dtype=float)  

In [2]: index = pd.core.arrays.sparse.IntIndex(5, np.array([0, 2, 4]))  

In [3]: dtype = pd.SparseDtype("float64", 0)      

In [4]: pd.arrays.SparseArray._simple_new(data, index, dtype)  
Out[4]: 
[1.0, 0, 2.0, 0, 3.0]
Fill: 0
IntIndex
Indices: array([0, 2, 4], dtype=int32)

In [5]: %timeit pd.arrays.SparseArray._simple_new(data, index, dtype)    
381 ns ± 4.83 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

while on released version this gives around 50µs (100x slower)

Noticed while investigating #32196

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Sparse Sparse Data Type labels Mar 19, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Mar 19, 2020
@rth
Copy link
Contributor

rth commented Mar 19, 2020

Thanks @jorisvandenbossche! Quick benchmark result below when running pd.DataFrame.sparse.from_spmatrix on a random sparse CSR array of given n_samples, n_features with a density=0.01,

label                 master (s)    PR (s)
n_samples n_features                   
100       100000         14.5374   10.1247
10000     10000           1.5599    1.1037
100000    100             0.0180    0.0134

so overall this makes that method around 30% faster.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche lgtm

@simonjayhawkins simonjayhawkins merged commit 34f3360 into pandas-dev:master Mar 19, 2020
@jorisvandenbossche jorisvandenbossche deleted the sparse-simple-new branch March 19, 2020 11:43
@jorisvandenbossche
Copy link
Member Author

@rth thanks for the timings! Yes, it was indeed a large part of the original slow from_spmatrix, the snippet in the issue does most of the rest

@jreback
Copy link
Contributor

jreback commented Mar 19, 2020

so we have asvs for this? also add a whats new note

@rth
Copy link
Contributor

rth commented Mar 19, 2020

Note added in #32825 that should work for both PRs I think.

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants